Implement Event Collection API using Spanner#1811
Implement Event Collection API using Spanner#1811keyurva merged 7 commits intodatacommonsorg:masterfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces the Event Collection API, enabling the retrieval of event data from Spanner. It achieves feature parity with the existing V2 implementation, allowing users to query events by type, location, and date, with support for additional filtering capabilities. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request implements the Event Collection API using Spanner, achieving feature parity with the V2 implementation. The changes are well-structured, with a clear separation of concerns for query building, execution, and data processing. The addition of comprehensive tests and golden files is commendable. I've identified a couple of areas for improvement related to deterministic behavior and code clarity, aligning with best practices for robust and maintainable code.
| for k, v := range arc.Filter { | ||
| if k == "typeOf" || k == "date" { | ||
| continue | ||
| } | ||
| if len(v) != 1 { | ||
| return nil // Invalid filter | ||
| } | ||
| spec, err := v2e.ParseEventCollectionFilter(k, v[0]) | ||
| if err != nil { | ||
| return nil // Invalid filter format | ||
| } | ||
| res.FilterProp = spec.Prop | ||
| res.FilterLowerLimit = spec.LowerLimit | ||
| res.FilterUpperLimit = spec.UpperLimit | ||
| res.FilterUnit = spec.Unit | ||
| break // V2 supports at most one extra filter | ||
| } |
There was a problem hiding this comment.
The iteration over the arc.Filter map is non-deterministic because map iteration order is not guaranteed in Go. While the comment on line 436 mentions that V2 supports at most one extra filter, relying on this assumption can make the code fragile. If a request with multiple extra filters is ever processed, the behavior will be unpredictable as to which filter is applied. To make the implementation more robust and deterministic, I suggest sorting the filter keys before iterating.
// Handle standard filters (e.g. area).
filterKeys := make([]string, 0, len(arc.Filter))
for k := range arc.Filter {
filterKeys = append(filterKeys, k)
}
sort.Strings(filterKeys)
for _, k := range filterKeys {
if k == "typeOf" || k == "date" {
continue
}
v := arc.Filter[k]
if len(v) != 1 {
return nil // Invalid filter
}
spec, err := v2e.ParseEventCollectionFilter(k, v[0])
if err != nil {
return nil // Invalid filter format
}
res.FilterProp = spec.Prop
res.FilterLowerLimit = spec.LowerLimit
res.FilterUpperLimit = spec.UpperLimit
res.FilterUnit = spec.Unit
break // V2 supports at most one extra filter
}References
- This rule emphasizes ensuring deterministic order when the underlying system (like map iteration or SPARQL query results without ORDER BY) does not guarantee it. Sorting map keys before iteration aligns with this principle to prevent unpredictable behavior.
| func populateGeoLocation(event *pbv1.EventCollection_Event, value string) { | ||
| // Note: The startLocation value in Spanner is usually a latLong/ DCID (e.g. latLong/577521_-958960). | ||
| // We parse it here for performance to avoid an extra database roundtrip. | ||
| // | ||
| // TODO(task): Revisit this optimization if we encounter valid startLocation values | ||
| // that are NOT latLong/ DCIDs but still need to be resolved to points, or if the | ||
| // assumption that dcids always contain coordinates is not true. | ||
| if strings.HasPrefix(value, "latLong/") { | ||
| parts := strings.Split(strings.TrimPrefix(value, "latLong/"), "_") | ||
| if len(parts) == 2 { | ||
| lat, err1 := strconv.ParseFloat(parts[0], 64) | ||
| lon, err2 := strconv.ParseFloat(parts[1], 64) | ||
| if err1 == nil && err2 == nil { | ||
| event.GeoLocations = append(event.GeoLocations, &pbv1.EventCollection_GeoLocation{ | ||
| Geo: &pbv1.EventCollection_GeoLocation_Point_{ | ||
| Point: &pbv1.EventCollection_GeoLocation_Point{ | ||
| Latitude: proto.Float64(lat / 100000.0), | ||
| Longitude: proto.Float64(lon / 100000.0), | ||
| }, | ||
| }, | ||
| }) | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
This function uses several magic strings and numbers ("latLong/", "_", 100000.0). These should be defined as constants at the package level to improve readability and maintainability. This makes it clear what these values represent and makes them easier to change if needed. Since I can only suggest changes within the function, I'll add the scaling factor as a local constant.
func populateGeoLocation(event *pbv1.EventCollection_Event, value string) {
const latLonScalingFactor = 100000.0
// Note: The startLocation value in Spanner is usually a latLong/ DCID (e.g. latLong/577521_-958960).
// We parse it here for performance to avoid an extra database roundtrip.
//
// TODO(task): Revisit this optimization if we encounter valid startLocation values
// that are NOT latLong/ DCIDs but still need to be resolved to points, or if the
// assumption that dcids always contain coordinates is not true.
if strings.HasPrefix(value, "latLong/") {
parts := strings.Split(strings.TrimPrefix(value, "latLong/"), "_")
if len(parts) == 2 {
lat, err1 := strconv.ParseFloat(parts[0], 64)
lon, err2 := strconv.ParseFloat(parts[1], 64)
if err1 == nil && err2 == nil {
event.GeoLocations = append(event.GeoLocations, &pbv1.EventCollection_GeoLocation{
Geo: &pbv1.EventCollection_GeoLocation_Point_{
Point: &pbv1.EventCollection_GeoLocation_Point{
Latitude: proto.Float64(lat / latLonScalingFactor),
Longitude: proto.Float64(lon / latLonScalingFactor),
},
},
})
}
}
}
}
n-h-diaz
left a comment
There was a problem hiding this comment.
Thanks for getting this done so quickly! :)
generally looks good, just wanted to clarify a few comments on how dates are handled
on geoJson escaping, it looks like current v2 does the single escape, which is how we've handled throughout for v3, so I think okay to keep for now. we can change escaping later if needed
| @{force_join_order=true} | ||
| GRAPH DCGraph MATCH (event:Node)-[:Edge {predicate: 'affectedPlace', object_id: 'country/LBR'}]->(), (event)-[:Edge {predicate: 'typeOf', object_id: 'FireEvent'}]->(), (event)-[:Edge {predicate: 'startDate'}]->(dateNode:Node) | ||
| WHERE | ||
| SUBSTR(dateNode.value, 1, 7) = '2020-10' |
There was a problem hiding this comment.
what happens if the event occurs over multiple months? (ie. do we want the events to be returned if requesting another month? not sure how this is handled in current v2, but we should make sure we match the current behavior)
There was a problem hiding this comment.
As I understood it, the event collection date responses return dates at a monthly granularity: https://screenshot.googleplex.com/7G7DVYoE27mgTEi
Following which the frontend requests event collection at the same (monthly) granularity.
So to your question, if an event occurs over multiple months, multiple requests are made one for each month. We can discuss if that's what we want moving forward but what we currently have gets us to feature parity.
There was a problem hiding this comment.
Oh I meant that if we request events from an event's "endDate" is it still expected to return - but looks like it's only for startDates though, so no change required I think!
| // Pattern: <-location{typeOf:EventType, date:Date, filter_prop:filter_val} | ||
| func parseEventCollection(req *pbv2.EventRequest, arcs []*v2.Arc) *pbv1.EventCollectionRequest { | ||
| if len(arcs) != 1 { | ||
| return nil |
There was a problem hiding this comment.
(optional, since not backward supported: could we add some informative error messages to help understand the various reasons a request didn't parse)
| if !strings.HasPrefix(edge.Value, "s2CellId/") { | ||
| event.Places = append(event.Places, edge.Value) | ||
| } | ||
| case predStartDate: |
There was a problem hiding this comment.
just to confirm, how is event.Dates populated? should we also add observationDates? or does this just copy from startDate? (example: https://datacommons.org/browser/fireEvent/2022-03-05_0x4737f90000000000)
(we should just match whatever v2 does for this)
There was a problem hiding this comment.
Yeah, I was trying to match v2 and the values seem to be the start dates. Happy to change if its supposed to be observationDates. LMK.
There was a problem hiding this comment.
Yeah it seems like just startDates, so looks good
| }, | ||
| } | ||
|
|
||
| var eventCollectionDcidsTestCases = []struct { |
There was a problem hiding this comment.
optional: since events support flexible schema, could we test a few other event types? (since we've hardcoded a few properties in the queries, want to confirm they'll work for other events too)
There was a problem hiding this comment.
Definitely! Do you have any other event types handy? If not, I can try to look them up.
There was a problem hiding this comment.
I think maybe at least check some the ones on the disaster page: https://datacommons.org/disasters/Earth (it looks like there are a lot more actual "%Event" types, but not sure if we ever surface them)
There was a problem hiding this comment.
Ah ok. Added test cases for another event type (FloodEvent).
| // TODO(task): Revisit this optimization if we encounter valid startLocation values | ||
| // that are NOT latLong/ DCIDs but still need to be resolved to points, or if the | ||
| // assumption that dcids always contain coordinates is not true. | ||
| if strings.HasPrefix(value, "latLong/") { |
There was a problem hiding this comment.
optional: since we're making this assumption, could we log an error if we find any examples otherwise? (so we can track and update if needed)
e31ce7e to
cdb0528
Compare
keyurva
left a comment
There was a problem hiding this comment.
Thanks for the review!
| @{force_join_order=true} | ||
| GRAPH DCGraph MATCH (event:Node)-[:Edge {predicate: 'affectedPlace', object_id: 'country/LBR'}]->(), (event)-[:Edge {predicate: 'typeOf', object_id: 'FireEvent'}]->(), (event)-[:Edge {predicate: 'startDate'}]->(dateNode:Node) | ||
| WHERE | ||
| SUBSTR(dateNode.value, 1, 7) = '2020-10' |
There was a problem hiding this comment.
As I understood it, the event collection date responses return dates at a monthly granularity: https://screenshot.googleplex.com/7G7DVYoE27mgTEi
Following which the frontend requests event collection at the same (monthly) granularity.
So to your question, if an event occurs over multiple months, multiple requests are made one for each month. We can discuss if that's what we want moving forward but what we currently have gets us to feature parity.
| }, | ||
| } | ||
|
|
||
| var eventCollectionDcidsTestCases = []struct { |
There was a problem hiding this comment.
Definitely! Do you have any other event types handy? If not, I can try to look them up.
| if !strings.HasPrefix(edge.Value, "s2CellId/") { | ||
| event.Places = append(event.Places, edge.Value) | ||
| } | ||
| case predStartDate: |
There was a problem hiding this comment.
Yeah, I was trying to match v2 and the values seem to be the start dates. Happy to change if its supposed to be observationDates. LMK.
| // Pattern: <-location{typeOf:EventType, date:Date, filter_prop:filter_val} | ||
| func parseEventCollection(req *pbv2.EventRequest, arcs []*v2.Arc) *pbv1.EventCollectionRequest { | ||
| if len(arcs) != 1 { | ||
| return nil |
n-h-diaz
left a comment
There was a problem hiding this comment.
Thanks!
something to consider while testing (can be follow up) is how we'd like to handle large responses - it looks like we do some truncation for prophet, we should see what spanner can support
17742aa to
fd9c759
Compare
da12135
Known Discrepancies with V2
propValsorder: The order of keys inpropValsis different from V2.geoJsonCoordinatesescaping: In the V2 response, quotes are triple-escaped (\\\"), whereas in the Spanner response they are single-escaped (\").V2 geoJsonCoordinates
V3 geoJsonCoordinates
@n-h-diaz - let me know if this can cause issues and I can change the V3 escaping.